智能论文笔记

Predicting pathways for old and new metabolites through clustering

Thiru Siddharth , Nathan Lewis

分类：机器学习

2022-11-28

The diverse metabolic pathways are fundamental to all living organisms, as they harvest energy, synthesize biomass components, produce molecules to interact with the microenvironment, and neutralize toxins. While discovery of new metabolites and pathways continues, the prediction of pathways for new metabolites can be challenging. It can take vast amounts of time to elucidate pathways for new metabolites; thus, according to HMDB only 60% of metabolites get assigned to pathways. Here, we present an approach to identify pathways based on metabolite structure. We extracted 201 features from SMILES annotations, and identified new metabolites from PubMed abstracts and HMDB. After applying clustering algorithms to both groups of features, we quantified correlations between metabolites, and found the clusters accurately linked 92% of known metabolites to their respective pathways. Thus, this approach could be valuable for predicting metabolic pathways for new metabolites.

translated by 谷歌翻译

Plant Species Classification Using Transfer Learning by Pretrained Classifier VGG-19

Thiru Siddharth , Bhupendra Singh Kirar , Dheeraj Kumar Agrawal

分类：计算机视觉 | (统计)机器学习

2022-09-07

深度学习目前是机器学习中最重要的分支，在语音识别，计算机视觉，图像分类和医学成像分析中的应用。植物识别是可以使用图像分类通过其叶子识别植物物种的领域之一。植物学家通过亲自检查将大量时间用于识别植物物种。本文描述了一种剖析瑞典叶子和识别植物物种的颜色图像的方法。为了实现更高的准确性，该任务是在预先训练的分类器VGG-19的帮助下使用转移学习完成的。分类的四个主要过程是图像预处理，图像增强，特征提取和识别，这些过程是作为整体模型评估的一部分进行的。 VGG-19分类器通过采用预定义的隐藏层（例如卷积层，最大池层和完全连接的层）来掌握叶子的特征，并最终使用Soft-Max层为所有植物类生成特征表示。该模型获得了与瑞典叶数据集的各个方面相关的知识，其中包含15种树类，并有助于预测未知植物的适当类别，准确性为99.70％，这比以前报告的研究工作高。

translated by 谷歌翻译

Linear features segmentation from aerial images

Zhipeng Chang , Siddharth Jha , Yunfei Xia

分类：计算机视觉 | 人工智能

2022-12-23

The rapid development of remote sensing technologies have gained significant attention due to their ability to accurately localize, classify, and segment objects from aerial images. These technologies are commonly used in unmanned aerial vehicles (UAVs) equipped with high-resolution cameras or sensors to capture data over large areas. This data is useful for various applications, such as monitoring and inspecting cities, towns, and terrains. In this paper, we presented a method for classifying and segmenting city road traffic dashed lines from aerial images using deep learning models such as U-Net and SegNet. The annotated data is used to train these models, which are then used to classify and segment the aerial image into two classes: dashed lines and non-dashed lines. However, the deep learning model may not be able to identify all dashed lines due to poor painting or occlusion by trees or shadows. To address this issue, we proposed a method to add missed lines to the segmentation output. We also extracted the x and y coordinates of each dashed line from the segmentation output, which can be used by city planners to construct a CAD file for digital visualization of the roads.

translated by 谷歌翻译

On Event Individuation for Document-Level Information Extraction

William Gantt , Reno Kriz , Yunmo Chen , Siddharth Vashishtha , Aaron Steven White

分类：自然语言处理 | 人工智能 | 机器学习

2022-12-19

As information extraction (IE) systems have grown more capable at whole-document extraction, the classic task of \emph{template filling} has seen renewed interest as a benchmark for evaluating them. In this position paper, we call into question the suitability of template filling for this purpose. We argue that the task demands definitive answers to thorny questions of \emph{event individuation} -- the problem of distinguishing distinct events -- about which even human experts disagree. We show through annotation studies and error analysis that this raises concerns about the usefulness of template filling evaluation metrics, the quality of datasets for the task, and the ability of models to learn it. Finally, we consider possible solutions.

translated by 谷歌翻译

Predicting Citi Bike Demand Evolution Using Dynamic Graphs

Alexander Saff , Mayur Bhandary , Siddharth Srivastava

分类：机器学习

2022-12-18

Bike sharing systems often suffer from poor capacity management as a result of variable demand. These bike sharing systems would benefit from models to predict demand in order to moderate the number of bikes stored at each station. In this paper, we attempt to apply a graph neural network model to predict bike demand in the New York City, Citi Bike dataset.

translated by 谷歌翻译

An Upper Bound for the Distribution Overlap Index and Its Applications

Hao Fu , Prashanth Krishnamurthy , Siddharth Garg , Farshad Khorrami

分类：机器学习

2022-12-16

This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier \textcolor{\colorname}{can be accurate with} only a small number of in-class samples and outperforms many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.

translated by 谷歌翻译

Privacy-Preserving Collaborative Learning through Feature Extraction

Alireza Sarmadi , Hao Fu , Prashanth Krishnamurthy , Siddharth Garg , Farshad Khorrami

分类：机器学习

2022-12-13

We propose a framework in which multiple entities collaborate to build a machine learning model while preserving privacy of their data. The approach utilizes feature embeddings from shared/per-entity feature extractors transforming data into a feature space for cooperation between entities. We propose two specific methods and compare them with a baseline method. In Shared Feature Extractor (SFE) Learning, the entities use a shared feature extractor to compute feature embeddings of samples. In Locally Trained Feature Extractor (LTFE) Learning, each entity uses a separate feature extractor and models are trained using concatenated features from all entities. As a baseline, in Cooperatively Trained Feature Extractor (CTFE) Learning, the entities train models by sharing raw data. Secure multi-party algorithms are utilized to train models without revealing data or features in plain text. We investigate the trade-offs among SFE, LTFE, and CTFE in regard to performance, privacy leakage (using an off-the-shelf membership inference attack), and computational cost. LTFE provides the most privacy, followed by SFE, and then CTFE. Computational cost is lowest for SFE and the relative speed of CTFE and LTFE depends on network architecture. CTFE and LTFE provide the best accuracy. We use MNIST, a synthetic dataset, and a credit card fraud detection dataset for evaluations.

translated by 谷歌翻译

Uniform Masking Prevails in Vision-Language Pretraining

Siddharth Verma , Yuchen Lu , Rui Hou , Hanchao Yu , Nicolas Ballas , Madian Khabsa , Amjad Almahairi

分类：机器学习

2022-12-10

Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. To implement MLM, the researcher must make two design choices: the masking strategy, which determines which tokens to mask, and the masking rate, which determines how many tokens to mask. Previous work has focused primarily on the masking strategy while setting the masking rate at a default of 15\%. In this paper, we show that increasing this masking rate improves downstream performance while simultaneously reducing performance gap among different masking strategies, rendering the uniform masking strategy competitive to other more complex ones. Surprisingly, we also discover that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks, suggesting that the role of MLM goes beyond language modeling in VL pretraining.

translated by 谷歌翻译

Dock2D: Synthetic data for the molecular recognition problem

Siddharth Bhadra-Lobo , Georgy Derevyanko , Guillaume Lamoureux

分类：机器学习

2022-12-07

Predicting the physical interaction of proteins is a cornerstone problem in computational biology. New classes of learning-based algorithms are actively being developed, and are typically trained end-to-end on protein complex structures extracted from the Protein Data Bank. These training datasets tend to be large and difficult to use for prototyping and, unlike image or natural language datasets, they are not easily interpretable by non-experts. We present Dock2D-IP and Dock2D-IF, two "toy" datasets that can be used to select algorithms predicting protein-protein interactions$\unicode{x2014}$or any other type of molecular interactions. Using two-dimensional shapes as input, each example from Dock2D-IP ("interaction pose") describes the interaction pose of two shapes known to interact and each example from Dock2D-IF ("interaction fact") describes whether two shapes form a stable complex or not. We propose a number of baseline solutions to the problem and show that the same underlying energy function can be learned either by solving the interaction pose task (formulated as an energy-minimization "docking" problem) or the fact-of-interaction task (formulated as a binding free energy estimation problem).

translated by 谷歌翻译

Hierarchical Termination Analysis for Generalized Planning

Siddharth Srivastava

分类：人工智能

2022-12-06

This paper presents a new approach for analyzing and identifying potentially useful generalized plans. It presents a new conceptual framework along with an algorithmic process for assessing termination and reachability related properties of generalized plans. The presented framework builds upon classic results on the analysis of graphs to decompose generalized plans into smaller components in a novel algorithm for conducting a hierarchical analysis for termination of arbitrary generalized plans. Theoretical analysis of the new framework establishes soundness of the presented algorithms and shows how it goes beyond existing approaches; empirical analysis illustrates the scope of this approach. Our analysis shows that this new approach can effectively identify termination for a significantly larger class of generalized plans than was possible using existing methods.

translated by 谷歌翻译